In recent times data visualization has become very important for any application which involves data. There are at least two ways in which being able to effectively visualize the data can help us:
R is pretty good at this. It has a good graphic engine that allows us
to make some very good-looking plots (compare them with Excel, Stata or
SPSS plots and you see what I mean). Moreover, the package
ggplot2, developed in 2005 and since become one of the most
widely-used tools to visualize data, makes it very easy to produce
nice-looking charts that convey complex information in a relatively
simple way.
Today we will do several things. We will play around with the
functions to plot data that are present in base R. Then we’ll start
using ggplot2. In both cases, as usual, we are going to see
some examples, however keep in mind that the possibilites of visualizing
data go well beyond what we see here. Finally we will see an application
applied to a topic model.
Very first, let’s load the data. To begin we are going to use 2 datasets:
cses2018itaRAW dataset, coming from an “uncleaned”
version of a survey dataset collected after the 2018 elections in Italy
for the CSES projectvdem_small, a very small subset of the V-Dem (Varieties of Democracy)
dataset, which includes only the EU-28 countries from 1990 to 2018.# In case you do not have `rio` installed:
# install.packages("rio", dependencies = T)
library(rio)
setwd("/folder/where/you/keep/the/data")
# Import the CSES data
cses <- import("cses2018itaRAW.xlsx")
# Import the V-Dem data
vdem <- import("vdem_small.dta")
head(vdem)## country year liberal free_exp gdp_pc_log postcom
## 1 Austria 1990 0.916 0.976 10.09307 0
## 2 Austria 1991 0.916 0.976 10.14471 0
## 3 Austria 1992 0.916 0.976 10.17072 0
## 4 Austria 1993 0.916 0.976 10.18203 0
## 5 Austria 1994 0.916 0.976 10.23265 0
## 6 Austria 1995 0.912 0.976 10.27291 0
The V-Dem dataset is useful because introduces a time dimension to the data that we might encounter from time to time. The same 28 countries are observed at up to 29 points in time (from 1990 to 2018). So we have two dimensions of variation: (1) the between-country variation, and (2) the within-country, over time variation. The dataset contains the following variables:
country: the name of the country (all the EU-28
countries)year: the year of observation (from 1990 to 2018)liberal: the V-Dem index of liberal democracy, which
emphasizes “the importance of protecting individual and minority rights
against the tyranny of the state and the tyranny of the majority”. It
ranges from 0 (low protection) to 1 (high protection)free_exp: the V-Dem index of free expression,
emphasizing to what extent the government respects “press and media
freedom, the freedom of ordinary people to discuss political matters at
home and in the public sphere, as well as the freedom of academic and
cultural expression”. It goes from 0 (low freedom) to 1 (high
freedom)gdp_pc_log: the logged GDP per capitapostcom: a dummy taking value 1 for countries in the
former communist block, and 0 for the othersThe CSES data contain the following variables:
id: respondent identifiersex: sex of the respondenteta: age of the respondenteta_gr: age of the respondent, in 3 categoriesmip1, mip2: the two answers to the open-ended question
“In your opinion, what are the two most important problems facing Italy
at the moment?”eco_eval: respondent’s evaluation of the economic
situation in the country over the last year (3 values: 1 = improved; 0 =
remained the same; -1 = worsened)area: the geographic area of residenceEven using base R, there are a lot of things that we can do
to visualize data. However, many of them can be done much more
effectively by using ggplot2. So we will look here at some
basic functions that we can use for a quick-and-dirty inspection of the
data.
hist().hist(vdem$gdp_pc_log)
Note that we can change a few parameters to the chart. For instance, in
the chart below we color the bars in blue, we add a label to the x-axis
and we remove the main title.
hist(vdem$gdp_pc_log, col = "blue",
xlab = "Logged GDP per capita", main = "")boxplot().boxplot(vdem$gdp_pc_log)Boxplots can also show a bivariate relationship between two
variables. For instance, we may want to look at the distribution of the
variable gdp_pc_log for each year in which the countries
were observed in the data. In this case we input a formula written as
y ~ x, where y is the variable for which we
want to see the distribution (in our case gdp_pc_log) and
x is the grouping factor (in our case year).
We can also make the plot horizontal.
boxplot(gdp_pc_log ~ year, data = vdem,
horizontal = T, # Make a horizontal boxplot
las = 2) # Set the direction of the value labels on the axesWe could also use country as a grouping factor.
boxplot(gdp_pc_log ~ country, data = vdem,
horizontal = T,
las = 2)However, this confronts us with 2 problems:
Let’s start with the first issue. In R, we can change the margins of
a figure by tweaking with the default graphical parameters that are used
by R to produce any chart in the current session. We can see them (and
change them) by using the function par(). The parameter
that we are interested in are the figure margins, that is called
mar. We can see the default values by typing the
following:
par()$mar## [1] 5.1 4.1 4.1 2.1
Now, the values refer to the distance between the margins of the figure and the margins of the plot area. The order in which they are saved is bottom, left, top, right. As you can see, there is a bit more space at the bottom (to accommodate the x-axis label and the value labels), on the left (to accommodate the y-axis label and the value labels), and on top (to accommodate the main title).
In our case, we need more space on the left, so we can change the values accordingly. However, the values will be changed for the entire session, so we better save the default values and restore them after we have done the plot.
# Save the default graphical parameters
pdefault <- par()
# Change the margins to have more space on the left
par(mar = c(5, 8, 4, 2))
# Make the boxplot
boxplot(gdp_pc_log ~ country, data = vdem,
horizontal = T,
las = 2,
ylab = "")# Restore the default parameters (will return a warning)
par(pdefault)The second issue is less technical and more substantive. The idea of making a boxplot by country is to compare countries with respect to their GDP per capita. However, is the countries are sorted in alphabetical order, the comparison is not that easy as it could be. The information is all there, but it is not organized in a way that makes the relevant message emerge–the relevant message being which countries have a systematically higher GDP per capita, and which ones have a systematically lower GDP per capita in the period observed. One way to make this piece of information emerge is by sorting the countries by their median GDP per capita. This way we can see the pattern of between-country variation much more clearly.
To do so, we need to create a new factor variable for the
country, with the levels sorted by the median of
gdp_pc_log within each country. To understand how this
works, we need to dig a bit into the property of factor variables in
R.
Let’s make an example by creating a fake data frame with a nominal
variable var.
fda <- data.frame(var = c("a", "b", "c"))As you can see, by default R stores nominal variables as factor variables. We can ask whether this is the case to R explicitly:
is.factor(fda$var)## [1] FALSE
Factor variables have a property: they are labeled numeric
variables, i.e. their true value is numeric (integer) but
the numeric value is a placeholder that makes sense only when associated
to a label (which, in our case, are the letters “a”, “b” and “c”). The
labels associated to the unique values of the factor variable are called
“levels”. We can see what the levels of var are by
looking at the variable itself:
fda$var## [1] "a" "b" "c"
When R has uses the factor variable as a grouping factor, it will use
by default the alphabetical order of the levels. This can be seen by
making a table of var:
table(fda$var)##
## a b c
## 1 1 1
However, we can change the order with which the levels of the factor
appear in the output (in this case the table) by changing the attribute
of the variable itself. For instance, if we want to change the order
from c("a", "b", "c") to c("b", "c", "a"), we
can do the following:
fda$var <- factor(fda$var, levels = c("b", "c", "a"))Now if we ask for a table of var, the values will be
displayed in the order that we set
table(fda$var)##
## b c a
## 1 1 1
We can take advantage of this property when we want to order nominal
categories in a plot. In our case, we need to order the levels of the
variable country by the median of gdp_pc_log.
We can do this automatically using the function reorder().
We do it by creating a new variable called
country_sort.
vdem$country_sort <- with(vdem, reorder(country, gdp_pc_log, function(x) median(x, na.rm = T)))
# Let's make a table of the first 100 observations in the data to see whether it worked
with(vdem[1:100, ], table(country_sort, country))## country
## country_sort Austria Belgium Bulgaria Cyprus
## Romania 0 0 0 0
## Bulgaria 0 0 29 0
## Latvia 0 0 0 0
## Lithuania 0 0 0 0
## Poland 0 0 0 0
## Estonia 0 0 0 0
## Croatia 0 0 0 0
## Slovakia 0 0 0 0
## Hungary 0 0 0 0
## Czech Republic 0 0 0 0
## Portugal 0 0 0 0
## Malta 0 0 0 0
## Slovenia 0 0 0 0
## Greece 0 0 0 0
## Cyprus 0 0 0 13
## Spain 0 0 0 0
## France 0 0 0 0
## Italy 0 0 0 0
## Finland 0 0 0 0
## Belgium 0 29 0 0
## Germany 0 0 0 0
## United Kingdom 0 0 0 0
## Austria 29 0 0 0
## Denmark 0 0 0 0
## Sweden 0 0 0 0
## Netherlands 0 0 0 0
## Ireland 0 0 0 0
## Luxembourg 0 0 0 0
Now we can make our boxplot using the (ordered) variable
country_sort instead of the (unordered) variable
country.
par(mar = c(5, 8, 4, 2))
boxplot(gdp_pc_log ~ country_sort, data = vdem,
horizontal = T,
las = 2,
ylab = "")par(pdefault)This way, the plot is much more informative than with the unordered countries.
plot(). Also, since we would have to
write the name of the dataset twice, once before each variable to
identify the variable with the dollar sign $, we can use
the function with() to specify the dataset on which we are
picking both variables at the beginning.with(vdem, plot(x = liberal, y = gdp_pc_log))Which can also be written using the formula (with y being
gdp_pc_log and x being liberal):
# This will produce the same plot as above
plot(gdp_pc_log ~ liberal, data = vdem)These were just a few examples, we can actually do a lot of different charts with base R, but the real deal comes now.
ggplot2While base R allows to do many things, there are some packages that
allow us to visualize very complex information in a relatively easy way.
The most widely used is surely ggplot2. Like many other
packages we saw these days, ggplot2 is part of the
tidyverse. This means that it is very well integrated with
other tidyverse routines, most notably dplyr.
To take advantage of this integration, instead of loading
ggplot2 only, we will load the full tidyverse
collection.
# In case you do not have `tidyverse` installed:
# install.packages("tidyverse", dependencies = T)
library(tidyverse)The syntax of ggplot2 works in a slightly different way
than the one of base R. In a way, it is closer to the logic of piped
expressions in dplyr (although it is not the same thing).
The main function to produce a chart is called (unsurprisingly)
ggplot(). The type of information that the function needs
is pretty much the same information we will have to put in a base R
function: we need to specify the data where the variables that we want
to plot are, and we need to specify the variables that we want to plot.
These go into a specific subfunction which is called
aes().
Note that the function ggplot() is used to produce
any chart that we want to make with ggplot2. This
clearly implies that we need to add somewhere the information about what
kind of plot we want to make (e.g. a scatter plot, a box plot, etc.).
This is done in ggplot2 by means of adding
layers to the plot. The layers are pieces of
information that we integrate in the existing plot. They may contain
more data, or just the geometric pattern that we want to
display in the chart (which makes the plot type). In fact, without
layers there is no chart in ggplot2. If we try to make a
plot without layers nothing will come out.
ggplot(data = vdem, aes(x = liberal, y = gdp_pc_log))Layers in ggplot2 are added by using the set of
functions geom_*. For instance, if we want to add
points to the plot (therefore making a scatter plot) we
can add the layer geom_point(). However, in
ggplot2 syntax, this is not done within the
ggplot() function, but it is chained after the initial call
by literally adding the layer using the plus +
sign.
ggplot(vdem, aes(x = liberal, y = gdp_pc_log)) +
geom_point()The layers are used for everything, from adding axis labels to
plotting more data. Inside the layer we can change the options for that
specific graphical parameter, but we can also specify the source data
and variables to plot in the layer. For instance, the very same chart
above could also be produced putting all the information within the
geom_point() function:
# Same as above, run to see
ggplot() +
geom_point(data = vdem, aes(x = liberal, y = gdp_pc_log))A few examples:
ggplot(vdem, aes(x = gdp_pc_log)) +
geom_histogram()ggplot(vdem, aes(x = gdp_pc_log)) +
geom_histogram(alpha = 0.2, col = "black", fill = "blue") +
xlab("Log GDP per capita") +
ylab("Count") +
theme_bw()ggplot(vdem, aes(x = reorder(country, gdp_pc_log, function(x) mean(x, na.rm = T)), y = gdp_pc_log)) +
geom_violin(fill = "gray50", trim = F, scale = "width") +
geom_boxplot(fill = "white", width = 0.2) +
coord_flip() +
ylab("Log GDP per capita") +
xlab("Country") +
theme_bw()The vdem data allows us to view time trends of some
specific variables. For instance, we can see how the “liberal democracy
index” changed in Italy over time. This would be a
bivariate plot, as we are looking at the relationship
between one phenomenon (time) and another (liberal democracy in a
country).
ggplot(filter(vdem, country == "Italy"), aes(y = liberal, x = year)) +
geom_line() +
geom_point() +
ylab("Liberal democracy index") +
xlab("Year") +
scale_x_continuous(breaks = seq(1990, 2018, by = 2)) +
theme_bw()We can add a third variable (effectively showing a multivariate relationship) by breaking down the relationship between two variables country by country. We can do it by adding a grouping factor and using the colors to identify different countries:
ggplot(vdem, aes(y = gdp_pc_log, x = year, group = country)) +
geom_line(aes(col = country)) +
ylab("Log GDP per capita") +
xlab("Year") +
scale_x_continuous(breaks = seq(1990, 2016, by = 2)) +
theme_bw()However, with 28 countries it gets very hard to understand anything.
An alternative we can show the pattern country by country by
faceting the plot with facet_wrap()
ggplot(vdem, aes(x = year, y = gdp_pc_log)) +
geom_line() +
facet_wrap(~country, ncol = 5) +
ylab("Log GDP per capita") +
xlab("Year") +
scale_x_continuous(breaks = seq(1990, 2020, by = 5)) +
theme_bw()Going back to the trend of liberal over time in Italy,
next to (or rather than) plotting the individual points connected by a
line, we can add a smoothing line to make the pattern
emerge in a clearer way. This is one step towards fitting a regression
model, and it is done with the function geom_smooth(). If
we do not specify anything, the function will fit a non-parametric
function to interpolate the point, generally a lowess (in case
of few observations, like in our case) or a GAM (when there are
more observations).
ggplot(filter(vdem, country == "Italy"), aes(y = liberal, x = year)) +
geom_point() +
geom_smooth() +
ylab("Liberal democracy index") +
xlab("Year") +
scale_x_continuous(breaks = seq(1990, 2018, by = 2)) +
theme_bw()Of course, this can be done by group as well:
ggplot(vdem, aes(y = gdp_pc_log, x = year, group = country)) +
geom_smooth(aes(col = country), se = F) +
ylab("Log GDP per capita") +
xlab("Year") +
scale_x_continuous(breaks = seq(1990, 2016, by = 2)) +
theme_bw()geom_smooth() can also be used to fit a regression line
in the plot by adding the option method = "lm". For
instance, we can look at the linear relationship between the “Freedom of
expression” index (the variable free_exp) and the “Liberal
democracy” index.
ggplot(vdem, aes(x = free_exp, y = liberal)) +
geom_point() +
geom_smooth(method = "lm", col = "red") +
ylab("Liberal democracy index") +
xlab("Freedom of expression index") +
theme_bw()ggplot2 with dplyrBeing part of the tidyverse, ggplot2 is
very well integrated with the other packages in the ecosystem. This
allows for instance for a very easy concatenation of dplyr
functions (to modify or summarize the data) and ggplot2
functions (to visualize them).
For instance, we may want to see whether from 1990 there has been a
convergence in terms of liberal democracy between Western-European
countries and post-communist countries in the Central-Eastern European
bloc. To do so we need first to summarize the variable
liberal by group (using as grouping factor the dummy
postcom) and plot the two different groups of countries
against each other. By using the pipe operator %>% we
can chain all the data manipulation operations and the plot directly. We
only need to remember to switch from the pipe (for the
dplyr part) to the plus + sign (for the
ggplot2 part).
vdem %>% # Data manipulation
group_by(year, postcom) %>%
summarize(libm = mean(liberal, na.rm = T)) %>%
mutate(
gr = ifelse(postcom == 0, "West\nEurope", "Central-Eastern\nEurope")
) %>%
ggplot(., aes(x = year, y = libm, group = gr)) + # Plotting
geom_smooth(aes(col = gr)) +
scale_color_manual(values = c("red", "blue"), "Geo-political area") +
scale_x_continuous(breaks = seq(1990, 2018, by = 2)) +
xlab("") +
ylab("Liberal democracy index") +
theme_bw() +
theme(legend.position = "bottom")Another example: using the cses data, we want to see
what is the share of respondents in each geographical area based on
their evaluation of the economy over the year before the (2018)
elections. We can do this in a couple of ways. We can make a bar
plot (as we saw before). We can do it in 2 ways, “stacked” or
“dodged”. We are going to check them both out.
# Data wrangling
ecodat <- cses %>%
filter(!is.na(eco_eval)) %>%
group_by(area, eco_eval) %>%
summarize(nobs = n()) %>%
group_by(area) %>%
mutate(
share = nobs / sum(nobs),
gr = ifelse(eco_eval == -1, "Got worse",
ifelse(eco_eval == 0, "Remained\nthe same",
"Got better"))
) %>%
ungroup() %>%
mutate(
area = factor(area, levels = rev(c("Nord-Ovest", "Nord-Est", "Centro", "Sud"))),
gr = factor(gr, levels = c("Got worse", "Remained\nthe same", "Got better"))
)
# Stacked bar chart
ggplot(ecodat, aes(x = area, y = share, fill = gr)) +
geom_bar(stat = "identity", position = "stack") +
scale_fill_manual(values = c("gray20", "gray50", "gray80"),
"Evaluation of the economy\nover the last year") +
scale_y_continuous(breaks = seq(0, 1, by = 0.2),
labels = scales::percent) +
ylab("Share of respondents") +
xlab("Geographical area") +
ggtitle("Stacked bar chart") +
theme_bw()# Dodged bar chart
ggplot(ecodat, aes(x = area, y = share, fill = gr)) +
geom_bar(stat = "identity", position = "dodge", col = "black") +
scale_fill_manual(values = c("gray20", "gray50", "gray80"),
"Evaluation of the economy\nover the last year") +
scale_y_continuous(breaks = seq(0, 1, by = 0.2),
labels = scales::percent) +
ylab("Share of respondents") +
xlab("Geographical area") +
ggtitle("Dodged bar chart") +
theme_bw() +
theme(legend.position = "bottom")Once we have the plots, we may want to save them so we can put them in our paper or report or website or whatever we need the plots for. In RStudio we can simply use the “Export” button over the bottom-right panel, but we might want to export it directly using the script so we can automate the process (for instance if we have to do many plots) and we have more control over some parameters like the size of the figure.
In ggplot2, we can do this using the function
ggsave(). We can do this in two ways:
ggsave() to that
object.# Save the plot in an object
plot1 <- ggplot(ecodat, aes(x = area, y = share, fill = gr)) +
geom_bar(stat = "identity", position = "dodge", col = "black") +
scale_fill_manual(values = c("gray20", "gray50", "gray80"),
"Evaluation of the economy\nover the last year") +
scale_y_continuous(breaks = seq(0, 1, by = 0.2),
labels = scales::percent) +
ylab("Share of respondents") +
xlab("Geographical area") +
theme_bw() +
theme(legend.position = "bottom")
# Save the plot
ggsave(plot1, filename = "eco_barplot.pdf",
device = cairo_pdf, width = 6, height = 5)Note: if we put a plot into an object and just save it, of course we won’t be able to see the plot. To do so we need to call it explicitly.
ggsave() directly after making the plot using the pipe
%>%. We’ll need to put an additional pair of parentheses
surrounding the group of functions making the plot. The code below will
also not visualize the plot.(ggplot(ecodat, aes(x = area, y = share, fill = gr)) +
geom_bar(stat = "identity", position = "dodge", col = "black") +
scale_fill_manual(values = c("gray20", "gray50", "gray80"),
"Evaluation of the economy\nover the last year") +
scale_y_continuous(breaks = seq(0, 1, by = 0.2),
labels = scales::percent) +
ylab("Share of respondents") +
xlab("Geographical area") +
theme_bw() +
theme(legend.position = "bottom")) %>%
ggsave(., filename = "eco_barplot.pdf",
device = cairo_pdf, width = 6, height = 5)Obviously, by the same logic we can also pipe everything, from data preparation, visualization and exporting. An example of a scatter plot with labels and grid faceting (it’s being visualized here even though this operation won’t show the plots, since they are being directly saved).
By the way, when we put labels on points, they might overlap, and
this is bad. Let’s use the package ggrepel to arrange them
automatically in a way that they do not overlap. However, labels are
too many and ggrepel will drop some of them
(returning a warning message).
# Remember to install the package
# install.packages("ggrepel)
library(ggrepel)
(vdem %>%
mutate(
decade = ifelse(year < 2000, "1990s",
ifelse(year < 2010, "2000s", "2010s")),
gr = ifelse(postcom == 0, "West\nEurope", "Central-Eastern\nEurope"),
gr = factor(gr, levels = c("West\nEurope", "Central-Eastern\nEurope"))
) %>%
group_by(country, decade) %>%
summarize(
gr = unique(gr),
liberal = mean(liberal, na.rm = T),
gdp = mean(gdp_pc_log, na.rm = T)
) %>%
ggplot(., aes(x = gdp, y = liberal)) +
geom_smooth(method = "lm", se = F, col = "blue") +
geom_point() +
geom_text_repel(aes(label = country)) + # The function from ggrepel()
facet_grid(gr ~ decade) +
xlab("Freedom of expression index") +
ylab("Liberal democracy index") +
theme_bw()) %>%
ggsave(., filename = "scatter_gdp_liberal.pdf",
device = cairo_pdf, width = 10, height = 7)## Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider increasing max.overlaps
Data visualization can help us with our text analysis tasks. One domain where visual diagnostics are quite useful is when we are doing topic models. There are several moments when visualization is useful: (1) when we need to select the number of topics that the model is supposed to produce, (2) when we want to visualize the top words in every topic, and (3) when we want to see the distribution of the topics in the corpus.
We will do an easy example using the dataset called
FB_posts.xlsx, which is a sample of Facebook posts from the
official pages of 6 UK parties (Conservatives, Labour, LibDems, Brexit
Party, Greens and UKIP) from August 2019 to October 2020. The texts have
been cleaned a bit already, so we will need to to only limited
pre-processing (we will do some standard operations).
Let’s start by installing the packages that we need:
quanteda (to make the corpus object, tokens and so on), and
its extension quanteda.textstats, tidytext
(another package to do text management), and stm to fit
structural topic models. The routine below will install the packages
that we need and load them automatically.
want = c("quanteda", "quanteda.textstats", "tidytext", "stm")
have = want %in% rownames(installed.packages())
if ( any(!have) ) { install.packages( want[!have] , dependencies = T) }
junk <- lapply(want, library, character.only = T)## Package version: 3.2.3
## Unicode version: 14.0
## ICU version: 70.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:rio':
##
## convert
## stm v1.3.6 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
rm(have, want, junk)We can now load the data and prepare it for the analysis.
# Import the data
fb_data <- import("FB_posts.xlsx")
# Make the corpus with "quanteda"
corp_uk <- corpus(fb_data, text_field = "message")
# Make tokens, further cleaning the text
tok_uk <- corp_uk %>%
tokens(remove_punct = T,
remove_symbols = T,
remove_numbers = T) %>%
tokens_remove(stopwords("en"))
# Look for relevant n-grams
ngr <- textstat_collocations(
tok_uk,
size = 2:3,
min_count = 10)
head(ngr)## collocation count count_nested length lambda z
## 1 boris johnson 104 104 2 8.630141 22.09312
## 2 stop brexit 49 49 2 4.763997 21.51931
## 3 green party 33 33 2 5.131901 20.20941
## 4 join us 29 29 2 4.890953 19.02816
## 5 get brexit 39 39 2 3.997924 18.84504
## 6 brexit done 42 42 2 5.541543 18.83167
# Combine the tokens to get the n-grams that we found
tok_uk <- tokens_compound(
tok_uk,
phrase(ngr$collocation),
join = T
)
# Make Document-Feature Matrix, and trim it (select only very frequent words that discriminate well between texts)
dfm_uk <- dfm(tok_uk) %>%
dfm_trim(
min_termfreq = 0.9, termfreq_type = "quantile",
max_docfreq = 0.1, docfreq_type = "prop",
verbose = T
)## Removing features occurring:
## - fewer than 9 times: 3,922
## - in more than 100 documents: 4
## Total features removed: 3,926 (89.7%).
# Remove the documents which have less than 3 words left
dfm_uk <- dfm_uk[rowSums(dfm_uk) > 2,]
dim(dfm_uk)## [1] 895 452
stm requires data in a slightly different format than
quanteda DFM, so we will need to convert the object to a
different type.
dfm_uk <- convert(dfm_uk, to = "stm")stm has a nice function to help us find the right number
of topics, which is called searchK(). We need to specify a
range of topic numbers, and stm will fit a set of topic
models in that range and return some fit statistics that we can use to
assess the best number of topics given the data (this is a brutal
quantitative approach that does not substitute actually reading the
topics produced by different solutions and figure out what makes more
sense).
Note: an advantage of Structural Topic Models is
that they allow to include predictors into the
estimate, both to improve the quality of the estimate itself and to get
some substantive insights. In this case, we will include in the model
the variable pt_week_y, which is a string that concatenates
the “year” of the post, the “month” and the “week”. This allows to
include in the estimate (in an extremely overfitting manner) some
information about when the post was done. This should correlate with the
probability that the post is about some specific topic (e.g. issues of
the day, but also self promotion by parties). We will just use this
variable to improve the estimate, we will not inspect it
substantively.
Watch out running the code below – it takes some time
set.seed(1234)
stm_many <- searchK(
dfm_uk$documents,
dfm_uk$vocab,
K = seq(10, 200, by = 10),
prevalence =~ factor(pt_week_y),
data = dfm_uk$meta,
cores = 4
)
stm_many$results## K exclus semcoh heldout residual bound lbound em.its
## 1 10 9.546842 -146.8258 -5.807072 1.386329 -48047.68 -48032.58 132
## 2 20 9.739671 -149.6311 -5.854532 1.301555 -46071.94 -46029.6 88
## 3 30 9.803465 -158.5489 -5.976493 1.319745 -45065.54 -44990.89 62
## 4 40 9.824982 -153.7451 -6.048548 1.465606 -44028.88 -43918.56 89
## 5 50 9.839221 -150.8483 -6.196674 1.794245 -43363.32 -43214.84 82
## 6 60 9.835363 -154.3851 -6.148196 2.55257 -42679.77 -42491.14 90
## 7 70 9.840677 -154.4908 -6.275768 4.716161 -42520.08 -42289.64 74
## 8 80 9.850034 -156.1086 -6.25317 32.8066 -42223.62 -41949.95 76
## 9 90 9.857129 -157.3209 -6.19464 -4.261304 -41943.56 -41625.41 77
## 10 100 9.865058 -152.1958 -6.149852 -1.926544 -41624.61 -41260.87 82
## 11 110 9.866419 -156.0532 -6.177197 -1.302 -41513.36 -41103.03 85
## 12 120 9.86994 -157.5743 -6.341506 -0.9103978 -41259.93 -40802.12 89
## 13 130 9.869962 -155.1936 -6.302026 -0.6663003 -41069.04 -40562.91 83
## 14 140 9.875517 -156.5658 -6.228148 -0.5466153 -41025.68 -40470.46 90
## 15 150 9.880368 -156.197 -6.231056 -0.4391988 -40891.74 -40286.72 83
## 16 160 9.883717 -155.282 -6.276008 -0.3654539 -40851.47 -40195.99 86
## 17 170 9.885801 -151.2771 -6.180262 -0.3126482 -40686.17 -39979.59 88
## 18 180 9.88696 -154.4574 -6.34932 -0.2616176 -40662.94 -39904.69 66
## 19 190 9.889069 -153.5413 -6.352825 -0.2362103 -40462.15 -39651.67 87
## 20 200 9.891776 -154.0574 -6.333194 -0.2139281 -40434.24 -39571.01 90
We can now extract the results, and plot some metrics: semantic coherence, exclusivity, held-out likelihood and the model residuals (the first 3 to be maximized, the latter to be minimized).
stm_many$results <- stm_many$results %>%
mutate(
across(everything(), ~as.numeric(.))
)
# Let us make a plot with the most relevant metrics
stm_many$results %>%
select(K, exclus, semcoh, heldout, residual) %>%
pivot_longer(
cols = -K
) %>%
mutate(
name = recode(
name,
"exclus" = "1. Exclusivity",
"semcoh" = "3. Semantic Coherence",
"heldout" = "2. Held-out Likelihood",
"residual" = "4. Residuals"
)
) %>%
ggplot(aes(x = K, y = value)) +
geom_line(aes(col = name)) +
scale_color_discrete(guide = "none") +
facet_wrap(~name, ncol = 2, scales = "free_y") +
scale_x_continuous(breaks = seq(0, 200, by = 10)) +
theme_bw()Hard to say, however there do not seem to be too many topics in here, for sure not more than 100. Let us see the trade-off between semantic coherence and exclusivity (the most meaningful metrics that we have here).
ggplot(stm_many$results, aes(x = semcoh, y = exclus)) +
geom_smooth(se = F) +
geom_point() +
geom_text_repel(aes(label = K)) +
xlab("Semantic coherence") + ylab("Exclusivity") +
theme_bw()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It looks like a good solution might be 50 topics. So let’s fit a
topic model (still including the variable pt_week_y as
predictor).
stm_fit <- stm(
dfm_uk$documents,
dfm_uk$vocab,
prevalence =~ factor(pt_week_y),
K = 50,
data = dfm_uk$meta,
seed = 1234,
verbose = F)We can now have a look at (1) what are the most important words in each topic (to understand the “content” or meaning of the topics) and (2) how prevalent each topic is in the corpus. To get the first piece of information, we need to extract the word probabilities for each topic, that in STM language is the set of beta coefficients associated with each word. To get the second piece of information, we need to extract the prevalence of each topic in the corpus, the gamma coefficients associated to each topic.
# Extract word probabilities for each topic (that `stm` calls "beta")
beta_stm <- tidytext::tidy(stm_fit, matrix = "beta")
# Extract topic probabilities in the corpus (that `stm` calls "gamma")
gamma_stm <- tidytext::tidy(stm_fit, matrix = "gamma")
# Clean "beta" data
top_terms <- beta_stm %>%
arrange(beta) %>%
group_by(topic) %>%
slice_max(beta, n = 7) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarize(terms = list(term)) %>%
mutate(terms = sapply(terms, paste, collapse = ", "))
# Clean "gamma" data
gamma_terms <- gamma_stm %>%
group_by(topic) %>%
summarize(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(top_terms, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
# Let's put them together and make a barplot
gamma_terms %>%
top_n(20, gamma) %>%
ggplot(aes(x = topic, y = gamma, label = terms, fill = topic)) +
geom_col(show.legend = FALSE) +
geom_text(hjust = 0, nudge_y = 0.0005, size = 3) +
coord_flip() +
scale_y_continuous(expand = c(0,0),
limits = c(0, 0.09),
labels = scales::percent_format(1L)) +
theme_bw() +
xlab("") + ylab(expression(gamma))And now it’s our turn to interpret the results.